Runbook: Node Not Ready

Alert

Prometheus Alert: KubeNodeNotReady / KubeNodeUnreachable
Grafana Dashboard: Cluster Health dashboard
Firing condition: A Kubernetes node has been in NotReady state for more than 5 minutes

Severity

Critical -- A NotReady node means pods scheduled on that node may be unreachable or evicted. In a 3-node cluster (1 server + 2 agents), losing a node reduces capacity by 33-50% and may affect pod scheduling and HA guarantees.

Impact

Pods on the affected node become unreachable
DaemonSet pods (Alloy, NeuVector enforcer, node-exporter) stop reporting from that node
Pod disruption budgets may prevent rescheduling if capacity is tight
If the affected node is the RKE2 server (control plane), the Kubernetes API may become unavailable
NeuVector enforcer loses runtime visibility on the affected node

Investigation Steps

Check node status:

kubectl get nodes -o wide

Describe the not-ready node for condition details:

kubectl describe node <node-name>

Look at the conditions section for specific failures:

kubectl get node <node-name> -o jsonpath='{.status.conditions[*]}' | jq .

Check if the node is reachable via SSH:

ssh sre-admin@<node-ip> "uptime && free -h && df -h"

If SSH is available, check kubelet status:

ssh sre-admin@<node-ip> "sudo systemctl status rke2-agent"
# Or for server nodes:
ssh sre-admin@<node-ip> "sudo systemctl status rke2-server"

Check kubelet logs on the node:

ssh sre-admin@<node-ip> "sudo journalctl -u rke2-agent --no-pager --since '30 minutes ago' | tail -100"

Check for disk pressure:

ssh sre-admin@<node-ip> "df -h && df -i"

Check for memory pressure:

ssh sre-admin@<node-ip> "free -h && cat /proc/meminfo | grep -E 'MemTotal|MemAvailable|SwapTotal'"

Check for PID pressure:

ssh sre-admin@<node-ip> "ps aux | wc -l"

Check containerd status:

ssh sre-admin@<node-ip> "sudo systemctl status containerd"
ssh sre-admin@<node-ip> "sudo crictl --runtime-endpoint unix:///run/k3s/containerd/containerd.sock ps"

Check system logs for hardware or kernel errors:

ssh sre-admin@<node-ip> "sudo dmesg | tail -50"
ssh sre-admin@<node-ip> "sudo journalctl -p err --since '1 hour ago' --no-pager"

Check pods that were running on the not-ready node:

kubectl get pods -A --field-selector spec.nodeName=<node-name>

Resolution

kubelet/RKE2 service stopped

Restart the RKE2 service:

# For agent nodes:
ssh sre-admin@<node-ip> "sudo systemctl restart rke2-agent"

# For server nodes:
ssh sre-admin@<node-ip> "sudo systemctl restart rke2-server"

Wait 1-2 minutes and verify the node returns to Ready:

kubectl get node <node-name> -w

Disk pressure

Identify large files or directories:

ssh sre-admin@<node-ip> "sudo du -sh /var/log/* | sort -rh | head -10"
ssh sre-admin@<node-ip> "sudo du -sh /var/lib/rancher/rke2/* | sort -rh | head -10"

Clean up container images:

ssh sre-admin@<node-ip> "sudo crictl --runtime-endpoint unix:///run/k3s/containerd/containerd.sock rmi --prune"

Rotate and compress old logs:

ssh sre-admin@<node-ip> "sudo journalctl --vacuum-size=500M"

Memory pressure

Check for pods consuming excessive memory:

kubectl top pods -A --sort-by=memory | head -20

If a specific pod is the cause, check its memory limits and consider adjusting the HelmRelease values
If system-level memory pressure, check for non-Kubernetes processes:

ssh sre-admin@<node-ip> "ps aux --sort=-%mem | head -20"

Network connectivity issues

Check if the node can reach the API server:

ssh sre-admin@<node-ip> "curl -k https://127.0.0.1:6443/healthz"

Check firewall rules:

ssh sre-admin@<node-ip> "sudo firewall-cmd --list-all"

Verify required ports are open (RKE2 uses 6443, 9345, 10250, 2379-2380)

Node completely unresponsive

If SSH is not available, attempt console access via Proxmox:

# From a machine with Proxmox access
ssh root@<proxmox-host> "qm status <vmid>"

If the VM is stopped, start it:

ssh root@<proxmox-host> "qm start <vmid>"

If the VM is running but unresponsive, force reset:

ssh root@<proxmox-host> "qm reset <vmid>"

After the node comes back, verify it rejoins the cluster:

kubectl get nodes -w

Cordon and drain (if node needs maintenance)

Cordon the node to prevent new pods:

kubectl cordon <node-name>

Drain existing pods:

kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --timeout=300s

Perform maintenance
Uncordon when ready:

kubectl uncordon <node-name>

Prevention

Monitor node conditions via the kube_node_status_condition metric in Prometheus
Set disk usage alerts at 80% (warning) and 90% (critical)
Set memory usage alerts at 85% (warning) and 95% (critical)
Configure log rotation on all nodes via Ansible (/etc/logrotate.d/)
Ensure RKE2 service is enabled on boot: systemctl enable rke2-agent / systemctl enable rke2-server
Maintain at least 3 worker nodes for pod scheduling redundancy
Run periodic CIS benchmark scans via NeuVector to catch drift

Escalation

If the RKE2 server (control plane) node is not ready: this is a P1 incident -- the Kubernetes API may be unavailable
If multiple nodes are not ready simultaneously: investigate shared infrastructure (network switch, storage, hypervisor)
If the node cannot rejoin the cluster after restart: the node may need to be re-provisioned using Ansible

🕸️ Ada Research Browser